Video Scene Parsing with Predictive Feature Learning: — Supplementary Material —

نویسندگان

Xiaojie Jin

Xin Li

Huaxin Xiao

Xiaohui Shen

Zhe Lin

Jimei Yang

Yunpeng Chen

Jian Dong

Luoqi Liu

Zequn Jie

Jiashi Feng

Shuicheng Yan

چکیده

In this supplementary material, we provide more implementation details including the architectures and training settings of baseline models and PEARL. We also present the experimental results and analysis of PEARL on Camvid dataset, as well as more qualitative evaluations of PEARL. 1. Implementation Details Since the class distribution is extremely unbalanced in video scene parsing, we increase the weight of rare classes during training, similar to [3, 6, 15]. In particular, we adopt the re-weighting strategy in [15]. The weight for the class y is set as ωy = 2 ⌈log 10(η/fy)⌉ where fy is the frequency of class y and η is a dataset-dependent scalar which is defined using the 85%/15% frequent/rare classes rule. All of our experiments are carried out on NVIDIA Titan X and Tesla M40 GPUs using Caffe library. 1.1. Network Architectures To demonstrate that PEARL can be applied with advanced deep architectures, we implement PEARL and baseline models using two state-of-the-art deep architectures: i.e. VGG16 and Res101 and compare their performance. For fair comparison, both the frame parsing network and the predictive learning network in PEARL share the same architecture as baseline models except for the input/output layers in the predictive learning network (which takes multiple frames as inputs and outputs RGB frames instead of parsing maps). In the following, we first introduce the architecture details of baseline models (i.e., the following VGG16baseline and Res101-baseline) and then the differences between PEARL and these baselines. • VGG16-baseline The VGG16-baseline is built upon DeepLab [2] with two modifications. First, to further enhance model’s ability for video scene parsing, we add Feature map Global pooling Up-sampling 1 × 1 c o n v , 1 0 2 4 1024-d Output features Concatenation Global Contexture Module Figure 1: Architecture of the global contexture module which is applied to encode the image global context information as suggested in ParseNet [12]. The output feature map of fc7/conv5 3 layer in VGG16/Res101 architectures in our experiments is passed through such global contextual module to produce a global context augmented feature map by global average pooling, up-sampling and concatenation with fc7/conv3 3 output feature map. A 1×1 convolutional layer is then applied to the concatenated feature map to produce the output features with 1,024 channels. three deconvolutional layers (each followed by ReLU) to up-sample the fc7 output features of DeepLab. The three deconvolutional layers consist of 4 × 4 convolutional kernels with striding of size 2 and padding of size 1. The number of kernels are 256, 128 and 64 respectively. Besides, following ParseNet [12], we use the global contexture module for fc7 features to enhance the model’s capability of capturing global context information. As shown in Figure 1, the global contexture module transforms the input feature map to a 1024-channel feature map. In experiments, we find such a module improves the parsing performance of the baseline model as it enlarges the model’s receptive field size and utilizes the global information to distinguish local confusing pixels. • Res101-baseline The architecture of our Res101baseline is illustrated in Figure 2. It is modified from the original Res101 [5] by adapting it to a fully convolutional network, following [14]. Specifically, we replace the average pooling layer and the 1,000-way classification layer with a fully convolutional layer (denoted as conv5 3cls in Figure 2) to produce dense parsing maps.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Dynamic Hierarchical Models for Anytime Scene Labeling

With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale or time-sensitive vision applications. We propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible trade-offs between efficiency and accuracy in pixel-level prediction. In particular, our approach incorporates the cost of ...

متن کامل

Supplementary Material: Deep Image Harmonization

To validate the effectiveness of our joint training scheme, we also try an alternative of incorporating an off-the-shelf state-of-the-art scene parsing model [3] into our single encoder-decoder harmonization framework to provide semantic information. This network architecture is shown in Figure 1. We show quantitative comparisons on our synthesized dataset in Table 1 and 2. The MSE and PSNR of ...

متن کامل

Deep Deconvolutional Networks for Scene Parsing

Scene parsing is an important and challenging problem in computer vision. It requires labeling each pixel in an image with the category it belongs to. Traditionally, it has been approached with hand-engineered features from color information in images. Recently convolutional neural networks (CNNs), which automatically learn hierarchies of features, have achieved record performance on the task. ...

متن کامل

Scene Parsing with Global Context Embedding Supplementary Materials

In Figure 1, we show the visualization in the feature space of the trained global context features. We sample 1000 images from the MIT ADE20k dataset [7] and pass these images through our proposed global context network to extract 4096dimensional context feature vectors. We use the t-SNE [4] algorithm to reduce the dimension of the extracted feature vectors from 4096 to 2 for better visualizati...

متن کامل

Compressed Domain Scene Change Detection Based on Transform Units Distribution in High Efficiency Video Coding Standard

Scene change detection plays an important role in a number of video applications, including video indexing, searching, browsing, semantic features extraction, and, in general, pre-processing and post-processing operations. Several scene change detection methods have been proposed in different coding standards. Most of them use fixed thresholds for the similarity metrics to determine if there wa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Video Scene Parsing with Predictive Feature Learning: — Supplementary Material —

نویسندگان

چکیده

منابع مشابه

Learning Dynamic Hierarchical Models for Anytime Scene Labeling

Supplementary Material: Deep Image Harmonization

Deep Deconvolutional Networks for Scene Parsing

Scene Parsing with Global Context Embedding Supplementary Materials

Compressed Domain Scene Change Detection Based on Transform Units Distribution in High Efficiency Video Coding Standard

عنوان ژورنال:

اشتراک گذاری